NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

When will my ML Job finish? Toward providing Completion Time Estimates through Predictability-Centric Scheduling

Faisal, Abdullah; Martin, Noah; Bashir, Hafiz; Lamelas, Swaminathan; Dogar, Fahad (July 2024, Usenix OSDI)

In this paper, we make a case for providing job completion time estimates to GPU cluster users, similar to providing the delivery date of a package or arrival time of a booked ride. Our analysis reveals that providing predictability can come at the expense of performance and fairness. Existing GPU schedulers optimize for extreme points in the trade-off space, making them either extremely unpredictable or impractical. To address this challenge, we present PCS, a new scheduling framework that aims to provide predictability while balancing other traditional objectives. The key idea behind PCS is to use Weighted-Fair-Queueing (WFQ) and find a suitable configuration of different WFQ parameters (e.g., queue weights) that meets specific goals for predictability. It uses a simulation-aided search strategy to efficiently discover WFQ configurations that lie around the Pareto front of the trade-off space between these objectives. We implement and evaluate PCS in the context of scheduling ML training workloads on GPUs. Our evaluation, on a small-scale GPU testbed and larger-scale simulations, shows that PCS can provide accurate completion time estimates while marginally compromising on performance and fairness.
more » « less
Full Text Available
Divided at the Edge - Measuring Performance and the Digital Divide of Cloud Edge Data Centers

https://doi.org/10.1145/3629138

Martin, Noah; Dogar, Fahad (November 2023, Proceedings of the ACM on Networking)

Cloud providers are highly incentivized to reduce latency. One way they do this is by locating data centers as close to users as possible. These “cloud edge” data centers are placed in metropolitan areas and enable edge computing for residents of these cities. Therefore, which cities are selected to host edge data centers determines who has the fastest access to applications requiring edge compute — creating a digital divide between those closest and furthest from the edge. In this study we measure latency to the current and predicted cloud edge of three major cloud providers around the world. Our measurements use the RIPE Atlas platform targeting cloud regions, AWS Local Zones, and network optimization services that minimize the path to the cloud edge. An analysis of the digital divide shows rising inequality as the relative difference between users closest and farthest from cloud compute increases. We also find this inequality unfairly affects lower income census tracts in the US. This result is extended globally using remotely sensed night time lights as a proxy for wealth. Finally, we demonstrate that low earth orbit satellite internet can help to close this digital divide and provide more fair access to the cloud edge.
more » « less
Full Text Available
Judicious QoS using Cloud Overlays

Haq, Osama; Doucette, Cody; Byers, John W.; Dogar, Fahad (December 2020, Computer communication review)

We revisit the long-standing problem of providing network QoS to applications, and propose the concept of judicious QoS -- combining the cheaper, best effort IP service with the cloud, which offers a highly reliable infrastructure and the ability to add in-network services, albeit at higher cost. Our proposed J-QoS framework offers a range of reliability services with different cost vs. delay trade-offs, including: i) a forwarding service that forwards packets over the cloud overlay, ii) a caching service, which stores packets inside the cloud and allows them to be pulled in case of packet loss or disruption on the Internet, and iii) a novel coding service that provides the least expensive packet recovery option by combining packets of multiple application streams and sending a small number of coded packets across the more expensive cloud paths. We demonstrate the feasibility of these services using measurements from RIPE Atlas and a live deployment on PlanetLab. We also consider case studies on how J-QoS works with services up and down the network stack, including Skype video conferencing, TCP-based web transfers and cellular access networks.
more » « less
Full Text Available
The Effects of Network Outages on User Experience in Augmented Reality Based Remote Collaboration - An Empirical Study

https://doi.org/10.1145/3476054

Ahsen, Tooba; Lim, Zi_Yi; Gardony, Aaron_L; Taylor, Holly_A; Ruiter, Jan_P_de; Dogar, Fahad (October 2021, Proceedings of the ACM on Human-Computer Interaction)

Augmented Reality (AR) applications can enable geographically distant users to collaborate using shared video feeds or interactive 3D holograms, and may be particularly useful in the socially distant context of the Covid-19 pandemic. However, a good user experience is key for their success and could be negatively impacted by network impairments, which are an inevitable occurrence in today's best-effort Internet. In this paper, we present the findings of an empirical user study, aimed at understanding the effects of network outages, on user experience and behavior, in a collaborative AR task. We highlight how network outages affected users in different ways depending on their role in the collaborative task, and how giving users explicit information about poor network conditions helped them deal with some of these negative effects. Furthermore, we report the strategies that users themselves adopted, to deal with outages, such as batching instructions, or shifting to a different spatial referencing style when communicating with their partners. Lastly, based on our findings, we present some design implications for future remote-collaborative AR applications.
more » « less

Search for: All records